Mining Co-Occurrence Matrices for SO-PMI Paradigm Word Candidates

نویسنده

  • Aleksander Wawer
چکیده

This paper is focused on one aspect of SOPMI, an unsupervised approach to sentiment vocabulary acquisition proposed by Turney (Turney and Littman, 2003). The method, originally applied and evaluated for English, is often used in bootstrapping sentiment lexicons for European languages where no such resources typically exist. In general, SO-PMI values are computed from word co-occurrence frequencies in the neighbourhoods of two small sets of paradigm words. The goal of this work is to investigate how lexeme selection affects the quality of obtained sentiment estimations. This has been achieved by comparing ad hoc random lexeme selection with two alternative heuristics, based on clustering and SVD decomposition of a word co-occurrence matrix, demonstrating superiority of the latter methods. The work can be also interpreted as sensitivity analysis on SO-PMI with regard to paradigm word selection. The experiments were carried out for Polish.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Pointwise Mutual Information (PMI) by Incorporating Significant Co-occurrence

We design a new co-occurrence based word association measure by incorporating the concept of significant cooccurrence in the popular word association measure Pointwise Mutual Information (PMI). By extensive experiments with a large number of publicly available datasets we show that the newly introduced measure performs better than other co-occurrence based measures and despite being resource-li...

متن کامل

Co-Occurrence-Based Error Correction Approach to Word Segmentation

To overcome the problems in Thai word segmentation, a number of word segmentation has been proposed during the long period of time until today. We propose a novel Thai word segmentation approach so called Co-occurrence-Based Error Correction (CBEC). CBEC generates all possible segmentation candidates using the classical maximal matching algorithm and then selects the most accurate segmentation ...

متن کامل

2018 Formatting Instructions for Authors Using LaTeX

Word embedding models such as GloVe rely on cooccurrence statistics from a large corpus to learn vector representations of word meaning. These vectors have proven to capture surprisingly fine-grained semantic and syntactic information. While we may similarly expect that co-occurrence statistics can be used to capture rich information about the relationships between different words, existing app...

متن کامل

Using Filtered Second Order Co-occurrence Matrix to Improve the Traditional Co-occurrence Model

Using co-occurrence statistics to measure word similarities/relatedness has applications in many areas of natural language processing. Our experiment results also indicate that two words with zero co-occurrence statistics could still be related. In this paper, we present two algorithms, both of which were evaluated on 80 synonym test questions from the Test of English as a Foreign Language (TOE...

متن کامل

Lexical Co-occurrence, Statistical Significance, and Word Association

Lexical co-occurrence is an important cue for detecting word associations. We present a theoretical framework for discovering statistically significant lexical co-occurrences from a given corpus. In contrast with the prevalent practice of giving weightage to unigram frequencies, we focus only on the documents containing both the terms (of a candidate bigram). We detect biases in span distributi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012